Cream of the Crop 1

home *** CD-ROM | disk | FTP | other *** search

/ Cream of the Crop 1 / Cream of the Crop 1.iso / PROGRAM / SMOOTH11.ARJ / SMOOTH.DOC < prev next >

Wrap

Text File | 1991-06-09 | 8KB | 209 lines

NAME smooth - split linear smoothing SYNOPSIS smooth [file] [options] USAGE By default, SMOOTH reads pairs of numbers (x- and y-values) from the standard input (or the given file), fits a smooth curve to the points, and writes to the standard output points from the smooth curve. Two smoothing algorithms are available. By default, the curve is calculated using the "lowness" procedure developed by W. S. Cleveland (see below). This technique achieves robustness by decreasing weights on data points which are far from the fitted line. An alternate procedure due to Art Owen is also provided (with the -s switch). This technique smooths the data while preserving sharp discontinuities in slope or value. As with GRAPH, each pair of points may optionally be followed by a comment. If the comment is surrounded by quotes "...", the comment may contain spaces. The given points, and their comments if any, will be included in the output. The interpolation may optionally be restarted after each label, so that a family of curves may be processed together (see the -b switch). Input lines starting with ";" are copied to the beginning of the output file but are otherwise ignored. Blank lines are ignored. OPTIONS Options can appear anywhere on the command line. -a [step [start]] automatic abscissas -b break smooth after each label -c general curve -f <num> for "lowness", the fraction of points to use for each fitted value (default .5) -n <num> for "lowness", the number of points to use for each fitted value (default 50%) -r print residuals rather than smoothed values -s split linear fit rather than "lowness" -xl take logs of x values before smoothing -yl take logs of y values before smoothing -zl take logs of z values before interpolating (implies -3) -3 3D case: x, y, and z given for each point If the -c switch is not used, the input points must be from a function - that is, the x values must be strictly increasing. The output points will also be from a function. (If the -b switch is used, this restriction applies only within each segment.) If the -c switch is used (indicating a general curve), the input points need not be from a function, but each pair of points must be separated from the previous pair by a finite distance. (If the -b switch is used, this restriction applies only within each segment.) The -f or -n switch designate the number of data points used to calculate a given smoothed value. The larger the number, the smoother the resulting curve. It is not possible to specify the number in terms of a range of the independent variable (e.g. a "time constant"). Therefore, these methods are appropriate when the density of data points is approximately constant, or else the density is higher in the "interesting" (i.e. rapidly changing) part of the curve. The distinction between the -f and -n switch becomes useful only when there are several data sets. Suppose one had two data sets for the same range of independent variables, and that one set had twice the number of data points as the other. For equivalent treatment, one could smooth the two sets with the same value for the -f switch. On the other hand, suppose two sets of data have data points at the same density, but that one set covered twice the range of independent variable (and therefore had twice as many data points). For equivalent smoothing, one could use the -n switch with the same value in each case. For general curves, the given x- and y- (and z-, if present) points are regarded as functions of the distance along a smoothed path. This doesn't work very well for split linear smoothing, since it tends to conceal abrupt changes in position. However, the split linear smooth is still able to preserve abrupt changes in the first derivative. METHODS Lowness by W. S. Cleveland, and split linear fit by A. Owen Lowness Robust locally weighted regression is a method for smoothing a scatterplot, (x[i], y[i]), i=1,...,n, in which the fitted value at x[k] is the value of a polynomial fit to the data using weighted least squares, where the weight for (x[i], y[i]) is large if x[i] is close to x[k]. Robustness is added by calculating residuals and repeating the procedure with reduced weights on points with large residuals. Reference: W. S. Cleveland, "Robust Locally Weighted Regression and Smoothing Scatterplots", Journal of the American Statistical Association, v74, n368, p829 (Dec 79) Split Linear Smoothing Algorithm Given: A list of window sizes, SizeList, and n pairs (x[i],y[i]) sorted on x, Returns: the split linear smooth of y on x. The general technique is due to Art Owen, who offers this discussion: "You should feel free to experiment with the algorithm, since it has some ad hoc parts. The essentials are: to use uncentered windows of varying sizes along with the central ones, to get zero weight on the worst fitting lines, and to make the weight attached to a particular line size and orientation vary smoothly as one traverses the data. We tried to find a simple way to meet all of these goals; the algorithm we settled on was the simplest that worked for us. ... "West and Chan et. al. are useful for getting numerically stable updating formulae for the regressions." references... John Alan McDonald and Art B. Owen, "Smoothing with Split Linear Fits", LCS Technical Report No. 7, SLAC-PUB 3423, AD-A149032, Laboratory for Computational Statistics, Dept. of Statistics, Stanford University, July 1984. West, D.H.D., 1979, Updating Mean and Variance Estimates: An Improved Method, Communications of the ACM, v 22, no. 9 p 532-535 (1979). Chan, T.F., Golub, G.H., and Leveque, R.J., 1983, Algorithms for Computing the Sample Variance: Analysis and Recommendations, The American Statistician v 37, p 242-247 (1983). IMPLEMENTATION The implementation of the split linear smoothing is based on pseudocode by Art Owen. The arrays take a lot of space. For n points, the number of doubles is approximately 38*n, plus 2*n for general curve, plus 2*n for 3D case. For 100 points and 8 byte doubles, this means at least 8*38*100=30400 bytes. Execution time... The program will employ a numeric coprocessor if it is available, but will run correctly without it. Time for "lowness" is proportional to the square of the number of data points. 101 points took 151 seconds on a 7.5 MHz V-20, with no 8087, but only 0.98 seconds on a 20 MHz 80386 with an 80387. Time for split linear smoothing increases slightly faster than linearly in the number of data points. The updating formulas mentioned by Art Owen are not used in this program. The selection of window sizes (a geometric sequence) is my own. -JVZ EXAMPLES The file ROUGH contains data points from sin(x) with one abrupt phase reversal (creating a discontinuity) and some added noise. To see the effect of the two algorithms, try C>smooth rough -f .2 >rlow C>smooth rough -s >rsl Then display all three files with GRAPH... C>graph rough rlow rsl -m -32 10 20 Note how the split linear smooth preserved the discontinuity whereas "lowness" smoothed it out somewhat. The file SP contains points from a general curve... C>smooth sp -f .2 -c >splow C>smooth sp -s -c >spsl C>graph sp splow spsl -m 1 10 20 This input file has a discontinuity in the first derivative which the split linear smooth was able to preserve. AUTHOR Copyright (c) 1987, 1991 by James R. Van Zandt (jrv@mbunix.mitre.org) 27 Spencer Dr., Nashua NH 03062, 603-888-2272. Resale forbidden, copying for personal use encouraged. Constructive comments welcome.